Are Malignant Cancer Cells Really Bigger?

10 December 2025

Ryan Mooney

The goal

Determine if there is a correlation between diagnosis of a tumor and tumor size.

The method to answer this question: permutation test!

If you are here from my website, and want to go back, please click here!

The data

The data set is from William Wolberg et. al, (1993) in Biomedical Image Processing and Biomedical Visualization., which contains data on 568 breast cancer tumor samples. The data set was downloaded from UC Irvine Machine Learning Repository.

What does the data look like?

head(tumor_data)
# A tibble: 6 × 32
        id diagnosis radius_mean texture_mean perimeter_mean area_mean
     <dbl> <chr>           <dbl>        <dbl>          <dbl>     <dbl>
1   842517 M                20.6         17.8          133.      1326 
2 84300903 M                19.7         21.2          130       1203 
3 84348301 M                11.4         20.4           77.6      386.
4 84358402 M                20.3         14.3          135.      1297 
5   843786 M                12.4         15.7           82.6      477.
6   844359 M                18.2         20.0          120.      1040 
# ℹ 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
#   concavity_mean <dbl>, concave_points_mean <dbl>, symmetry_mean <dbl>,
#   fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
#   perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
#   compactness_se <dbl>, concavity_se <dbl>, concave_points_se <dbl>,
#   symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
#   texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>, …

Benign vs Malignant Tumors (National Cancer Institute, 2001)

Major Differences: Circularity, nucleation, rigidity, size?

Hypotheses

The following hypotheses are rooted in the fact that malignant cancer cells go through the cell cycle faster, including the G1 and G2 phases, which are periods of cell growth. So, I hypothesize that malignant cancer cells in general are larger on average than benign cancer cells.

The null hypothesis: benign tumors and malignant tumors have cells of the same average area.

The alternative hypothesis: malignant tumors have larger average cell areas.

The Variables of Interest and Test Statistic

The variables I will be investigating are diagnosis (whether the tumor is malignant or benign), designated by an ‘M’ or a ‘B’ in the diagnosis column, and mean tumor cell area, which a calculated value for each sample in the area_mean column.

The statistic to test the potential difference in cell area between benign and malignant tumors is the difference in means between area in the benign and malignant tumor samples.

Let’s have a look at the observed test statistic

tumor_data |> 
  group_by(diagnosis) |> 
  summarize(ave_area = mean(area_mean))
# A tibble: 2 × 2
  diagnosis ave_area
  <chr>        <dbl>
1 B             463.
2 M             978.

So, it looks like the mean area of malignant tumor cells is larger than that of benign tumor cells. However, is that generalizable to other breast tumors? Off to the permutation test!

The Permutation Test

set.seed(47)
perm_data <- function(rep, data) {
  data |>
    select(diagnosis, area_mean) |>
    mutate(area_perm = sample(area_mean, replace = FALSE)) |>
    group_by(diagnosis) |>
    summarize(
      obs_mean  = mean(area_mean),
      perm_mean = mean(area_perm)) |>
    summarize(
      obs_mean_diff  = diff(obs_mean),
      perm_mean_diff = diff(perm_mean),
      rep = rep
    )
}

map(c(1:1000), perm_data, data = tumor_data) |> 
  list_rbind()
# A tibble: 1,000 × 3
   obs_mean_diff perm_mean_diff   rep
           <dbl>          <dbl> <int>
 1          515.         30.9       1
 2          515.        -52.8       2
 3          515.         63.0       3
 4          515.         24.5       4
 5          515.         -4.64      5
 6          515.          0.188     6
 7          515.         -6.70      7
 8          515.          9.85      8
 9          515.         45.6       9
10          515.        -55.7      10
# ℹ 990 more rows

Visualizing the null distribution

What’s the p value?

perm_stats |> 
    summarize(p_val = mean(perm_mean_diff > obs_mean_diff))
# A tibble: 1 × 1
  p_val
  <dbl>
1     0

Conclusions

The permutation test yielded a p-value of 0, indicating that the observed difference in mean cell size between malignant and benign breast tumors (515.479) did not occur once in 1,000 random permutations of the data. This extremely small p-value provides very strong evidence against the null hypothesis that benign and malignant tumors have the same average cell size. Thus, I claim that, in general, malignant breast cancer cells have higher average sizes than benign breast cancer cells.

Implications

This mean that average cell size could potentially serve as a potential quantitative metric for the rapid and automated classification of tumor malignancy.

References

“Normal and Cancer Cells Structure: Image Details.” NCI Visuals Online, National Cancer Institute, (2001). visualsonline.cancer.gov/details.cfm?imageid=2512.

Street, W.N., Wolberg, W.H., & Mangasarian, O.L. “Nuclear feature extraction for breast tumor diagnosis.” (1993) Proc. SPIE 1905: Biomedical Image Processing and Biomedical Visualization. https://doi.org/10.1117/12.148698

Wolberg, W., Mangasarian, O., Street, N., & Street, W. “Breast Cancer Wisconsin (Diagnostic)” (1993) UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B